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Abstract 

Hash tables are one of the most fundamental data structures in computer science, in both theory and 
practice. They are especially useful in external memory, where their query performance approaches the 
ideal cost of just one disk access. Knuth [13] gave an elegant analysis showing that with some simple 
collision resolution strategies such as linear probing or chaining, the expected average number of disk 
I/Os of a lookup is merely 1 + 1 /2 n ( b \ where each I/O can read a disk block containing b items. Inserting 
a new item into the hash table also costs 1 + 1/2 S2 ( 6 ' I/Os, which is again almost the best one can do 
if the hash table is entirely stored on disk. However, this assumption is unrealistic since any algorithm 
operating on an external hash table must have some internal memory (at least 0(1) blocks) to work 
with. The availability of a small internal memory buffer can dramatically reduce the amortized insertion 
cost to o(l) I/Os for many external memory data structures. In this paper we study the inherent query- 
insertion tradeoff of external hash tables in the presence of a memory buffer. In particular, we show that 
for any constant c > 1, if the query cost is targeted at 1 + 0(l/b c ) I/Os, then it is not possible to support 
insertions in less than 1 — 0(l/b~ ~ ) I/Os amortized, which means that the memory buffer is essentially 
useless. While if the query cost is relaxed to 1 + 0(l/b c ) I/Os for any constant c < 1, there is a simple 
dynamic hash table with o(l) insertion cost. These results also answer the open question recently posed 
by Jensen and Pagh [12]. 
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1 Introduction 



Hash tables are the most efficient way of searching for a particular item in a large database, with constant 
query and update times. They are arguably one of the most fundamental data structures in computer science, 
due to their simplicity of implementation, excellent performance in practice, and many nice theoretical 
properties. They work especially well in external memory, where the storage is divided into disk blocks, 
each containing up to 6 items. Thus collisions happen only when there are more than 6 items hashed into the 
same location. Using some common collision resolution strategies such as linear probing or chaining, the 
expected average cost of a successful lookup of an external hash table is merely 1 + l/2 n ^ disk accesses (or 
simply I/Os), provided that the load factor 1 a is less than a constant smaller than 1. The expectation is with 
respect to the random choice of the hash function, while the average is with respect to the uniform choice of 
the queried item. An unsuccessful lookup costs slightly more, but is the same as that of a successful lookup 
if ignoring the constant in the big-Omega. Knuth [13] gave an elegant analysis deriving the exact formula 
for the query cost, as a function of a and b. As typical values of b range from a few hundreds to a thousand, 
the query cost is extremely close to just one I/O; some exact numbers are given in [13, Section 6.4]. 

Inserting or deleting an item from the hash table also costs 1 + l/2 n ^ I/Os: We simply first read 
the target block where the new item should go, then write it back to disk 2 . If one wants to maintain the 
load factor we can periodically rebuild the hash table using schemes like extensible hashing [10] or linear 
hashing [14], but this only adds an extra cost of 0(1/6) I/Os amortized. Jensen and Pagh [12] demonstrate 
how to maintain the load factor at a = 1 — 0(l/6a) while still supporting queries in 1 + 0(l/6a ) I/Os and 
updates in 1 + 0(1/65) I/Os. Indeed, one cannot hope for lower than 1 I/O for an insertion, if the hash table 
must reside on disk entirely and there is no space in main memory for buffering. However, this assumption 
is unrealistic, since an algorithm operating on an external hash table has to have at least a constant number 
of blocks of internal memory to work with. So we must include a main memory of size m in our setting 
to model the problem more accurately. In fact, this is exactly what the standard external memory model 
[1] depicts: The system has a disk of infinite size partitioned into blocks of size b, and a main memory 
of size m. Computation can only happen in main memory, which accesses the disk via I/Os. Each I/O 
can read or write a disk block, and the complexity is measured by the number of I/Os performed by an 
algorithm. The presence of a no-cost main memory changes the problem dramatically, since it can be used 
as a buffer space to batch up insertions and write them to disk periodically, which could significantly reduce 
the amortized insertion cost. The abundant research in the area of I/O-efficient data structures has witnessed 
this phenomenon numerous times, where the insertion cost can be typically brought down to only slightly 
larger than 0(1/6) I/Os. Examples include the simplest structures like stacks and queues, to more advanced 
ones such as the buffer tree [2] and the priority queue [4, 9]. Many of these results hold as long as the buffer 
has just a constant number of blocks; some require a larger buffer of 0(6) blocks (known as the tall cache 
assumption). Please see the surveys [3, 18] for a complete account of the power of buffering. 

Therefore the natural question is, can we (or not) lower the insertion cost of a dynamic hash table 
by buffering without sacrificing its near-perfect query performance? Interestingly, Jensen and Pagh [12] 
recently posed the same question, and conjectured that the insertion cost must be 0(1) I/Os if the query cost 
is required to be 0(1) I/Os. 

1 The load factor is defined to be ratio between the minimum number of blocks required to store n data records, \n/b~\ , and the 
actual number of blocks used by the hash table. 

2 Rigorously speaking, this is 2 + l/2 n ^ I/Os, but since disk I/Os are dominated by the seek time, writing a block immediately 
after reading it can be considered as one I/O. 
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Our results. In this paper, we confirm that the conjecture of Jensen and Pagh [12] is basically correct but 
not accurate enough. Specifically we obtain the following results. Consider any dynamic hash table that 
supports insertions in expected amortized t u I/Os and answers a successful lookup query in expected t q I/Os 

c— 1 

on average. We show that if t q < 1+0 (1/6 C ) for any constant c > 1, then we must have t u > l—0(l/b~). 
This is only an additive term of away from how the standard hash table is supporting insertions, 

which means that buffering is essentially useless in this case. However, if the query cost is relaxed to 
t q < 1+0(1/6 C ) for any constant < c < 1, we present a simple dynamic hash table that supports insertions 
in t u = 0(6 C ~ 1 ) = o(l) I/Os. For this case we also present a matching lower bound of t u = 0(6 C ~ 1 ). 
Finally for the case t q = 1 + 0(1/6), we show a tight bound of t u = 0(1). Our results are pictorially 
illustrated in Figure 1 , from which we see that we now have an almost complete understanding of the entire 
query-insertion tradeoff, and t q = 1 + 0(1/6) seems to be the shaip boundary separating effective and 
ineffective buffering. We prove our lower bounds for the three cases above using a unified framework in 
Section 2. The upper bound for the first case is simply the standard hash table following [13]; we give the 
upper bounds for the other two cases in Section 3. 
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Figure 1 : The query-insertion tradeoff. 



In this paper we only consider the query-insertion tradeoff for the following reasons. First, our primary 
interest is on the lower bound, a query-insertion tradeoff lower bound is certainly applicable to the query- 
update tradeoff for more general updates that include both insertions and deletions. And secondly, there 
tends to be a lot more insertions than deletions in many practical situations like managing archival data. For 
similar reasons we only consider the query cost as that of a successful lookup. 

Let h{x) be a hash function that maps an item x to a hash value between and u — 1. In our lower 
bound construction, we will insert a total of n independent items such that each h(x) is uniformly randomly 
distributed between and u — 1, and we prove a lower bound on the expected amortized cost per insertion, 
under the condition that at any time, the hash table must be able to answer a query for the already inserted 
items with the desired expected average query bound. Thus, our lower bound holds even assuming that 
h(x) is an ideal hash function that maps each item to a hash value independently uniformly at random, a 
justifiable assumption [15] often made in many works on hashing. Also note that since we use an input that 
is uniformly at random, it is sufficient to consider only deterministic algorithms as randomization will not 
help any more. 
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When proving our lower bound we make the only requirement that items must be treated as atomic 
elements, i.e., they can only be moved or copied between memory and disk in their entirety, and when 
answering a query, the query algorithm must visit the block (in memory or on disk) that actually contains 
the item or one of its copies. Such an indivisibility assumption is also made in the sorting and permuting 
lower bounds in external memory [1]. We assume that each machine word consists of logu bits and each 
item occupies one machine word. A block has b words and the memory stores up to m words. We assume 
that each block is not too small: b > logu. Our lower and upper bounds hold for the wide range of 
parameters Q (6 1+2c ) < — < 2°^ . Finally, we comment that our lower bounds do not depend on the load 
factor, which implies that the hash table cannot do better by consuming more disk space. 

Related results. Hash tables are widely used in practice due to their simplicity and excellent performance. 
Knuth's analysis [13] applies to the basic version where h(x) is assumed to be an ideal random hash function 
and t q is the expected average cost. Afterward, a lot of works have been done to give better theoretical 
guarantees, for instance removing the ideal hash function assumption [7], making t q to be worst-case [8, 11, 
17], etc. Please see [16] for a survey on hashing techniques. Lower bounds have been sparse because in 
internal memory, the update time cannot be lower than 0(1), which is already achieved by the standard hash 
table. Only with some strong requirements, e.g., when the algorithm is deterministic and t q is worst-case, 
can one obtain some nontrivial lower bounds on the update time [8]. Our lower bounds, on the other hand, 
hold for randomized algorithms and do not need t q to be worst-case. 

As commented earlier, in external memory there is a trivial lower bound of 1 I/O for either a query or 
an update, if all the changes to the hash table must be committed to disk after each update. However, the 
vast amount of works in the area of external memory algorithms have never made such a requirement. And 
indeed for many problems, the availability of a small internal memory buffer can significantly reduce the 
amortized update cost without affecting the query cost [2-4,9, 18]. Unfortunately, little is known on the 
inherent limit of what buffering can do. The only nontrivial lower bound on the update cost of any external 
data structure with a memory buffer is a paper by Fagerberg and Brodal [6], who gave a query-insertion 
tradeoff for the predecessor problem in a natural external version of the comparison model, a model much 
more restrictive than the indivisibility model we use. As assuming a comparison-based model precludes any 
hashing techniques, their techniques are inapplicable to the problem we have at hand. To the best of our 
knowledge, no nontrivial lower bound on external hashing of any kind is known. 

2 Lower Bounds 

To obtain a query-insertion tradeoff, we start with an empty hash table and insert a total of n independent 
items such that h(x) is uniformly randomly distributed in U = {0, . . . , u — 1}. We will derive a lower bound 
on t u , the expected amortized number of I/Os for an insertion, while assuming that the hash table is able to 
answer a successful query in t q I/Os on average in expectation after the first i items have been inserted, for 
all i = 1, . . . , n. We assume that all the h(x)'s are different, which happens with probability 1 — 0{\/n) 
as long as u > n 3 by the birthday paradox. In the sequel we will not distinguish between an item x and its 
hash value h(x). Under this setting we obtain the following tradeoffs between t q and t u . 

Theorem 1 For any constant c > 0, suppose we insert a sequence ofn > O (m ■ 6 1+2c ) random items into 
an initially empty hash table. If the total cost of these insertions is expected n ■ t u I/Os, and the hash table is 
able to answer a successful query in expected average t q I/Os at any time, then the following tradeoffs hold: 

1. Ift q < 1 + 0(l/b c )forany c > 1, then t u >l- 0(1/6^); 
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2. Iftg < 1 + 0(l/b), then t u > Q(l); 

3. Ift q < 1 + 0(l/b c )forany < c < 1, then t u > ft^" 1 ). 

The abstraction. To abstractly model a dynamic hash table, we ignore any of its auxiliary structures but 
only focus on the layout of items. Consider any snapshot of the hash table when we have inserted k items. 
We divide these k items into three zones. The memory zone M is a set of at most m items that are kept 
in memory. It takes no I/O to query any item in M. All items not in M must reside on disk. Denote all 
the blocks on disk by B\, B2, . . . , Bd- Each Bi is a set of at most b items, and it is possible that one item 
appears in more than one Sj. Let / : U — > {1, . . . , d} be any function computable within memory, and we 
divide the disk-resident items into two zones with respect to / and the set of blocks B\, . . . , Bd- The fast 
zone F contains all items x such that x 6 Bu x y. These are the items that are accessible with just one I/O. 
We allocate all the remaining items into the slow zone S: These items need at least two I/Os to locate. Note 
that under random inputs, the sets M, F, S, B\ , . . . , Bd are all random sets. 

Any query algorithm on the hash table can be modeled as described, since the only way to find a queried 
item in one I/O is to compute the index of a block containing x with only the information in memory. If the 
memory-resident computation gives an incorrect address or anything else, at least 2 I/Os will be necessary. 
Because any such / must be computable within memory, and the memory has m log u bits, the hash table 
can employ a family T of at most 2 mlogu distinct /'s. Note that the current / adopted by the hash table is 
dependent upon the already inserted items, but the family T has to be fixed beforehand. 

Suppose the hash table answers a successful query with an expected average cost of t q = 1 + 5 I/Os, 
where 5 = l/b c for any constant c > 0. Consider the snapshot of the hash table when k items have been 
inserted. Then we must have E[\F\ + 2 • \S\]/k <1 + S. Since \F\ + \S\ = k - \M\ and E[\M\] < m, we 
have 

E[\S\] < m + 5k. (1) 
We also have the following high-probability version of (1). 

Lemma 1 Let (j) > l/f^ 0-1 )/ 4 and let k > <f>n. At the snapshot when k items have been inserted, with 
probability at least 1 — 2(f), \S\ < m + 4 A;. 

Proof: On this snapshot the hash table answers a query in expected average 1 + 5 I/Os. We claim that 
with probability at most 20, the average query cost is more than 1 + 5/<p. Otherwise, since in any case the 
average query cost is at least 1 — m/k (assuming all items not in memory need just one I/O), we would have 
an expected average cost of at least 

(1 - 20)(1 - m/k) + 2(j) ■ (1 + 5/4>) > 1 + 5, 

provided that ^ > -h, which is valid since we assume that ^ > b 1+2c . The lemma then follows from the 
same argument used to derive (1). □ 

Basic idea of the lower bound proof. For the first (f>n items, we ignore the cost of their insertions. Con- 
sider any / : U — > {1, . . . , d}. For i = 1, . . . , d, let aii = \f~ l (i)\/u, and we call (a\, . . . , ad) the 
characteristic vector of /. Note that J2i a -i = !• F° r anv one °f the first cj>n items, since it is randomly 
chosen from U, f will direct it to Bi with probability c^. Intuitively, if a« is large, too many items will be 
directed to Bi. Since Bi contains at most b items, the extra items will have to be pushed to the slow zone. 
If there are too many large c^'s, S will be large enough to violate the query requirement. Thus, the hash 
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table should use an / that distributes items relatively evenly to the blocks. However, if / evenly distributes 
the first 4>n items, it is also likely to distribute newly inserted items evenly, leading to a high insertion cost. 
Below we formalize this intuition. 

For the first tradeoff of Theorem 1, we set 5 = l/b c . We also pick the following set of parameters 
4> = 1/&( C-1 )/ 4 , p = 2£>( c+3 )/ 4 /n, s = n./&( c+1 " 2 . We will use different values for these parameters when 
proving the other two tradeoffs. Given an / with characteristic vector (a±, . . . , ad), let = {i | Oj > p} 
be the collection of block indices with large a/'s. We say that the indices in form the bad index area and 
others form the good index area. Let A/ = J2ieDf Q «- Note that there are at most Xf / p indices in the bad 
index area. We call an / with A/ > <fi a bad function; otherwise it is a good function. The following lemma 
shows that with high probability, the hash table should use a good function / from T . 

Lemma 2 At the snapshot when k items are inserted for any k > <j)n, the function f used by the hash table 
is a good function with probability at least 1 — 2(f) — l/2^ b \ 

Proof: Consider any bad function / from T. Let Xj be the indicator variable of the event that the j-th 
inserted item is mapped to the bad index area, j = 1, . . . , k. Then X = J2j=i Xj is the total number of 
items mapped to the bad index area of /. We have E[X] = Xfk. By Chernoff inequality, we have 



Pr 



X < h f k 



(l/3) 2 A / fc 

< e 2 < e 



namely with probability at least l — e~~, we have X > ^Xfk. Since the family T contains at most 2" llo s u 

bad functions, by union bound we know that with probability at least 1 — 2 mlogu ■ e is" > 1 — l/2 n ( 6 ) (by 
the parameters chosen and the assumption that n > Q(mb l+2c ), b > logn), for all the bad functions in T, 
we have X > |A//c. 

Consequently, since the bad index area can only accommodate b ■ A/ / p items in the fast zone, at least 
| A/A: — bXf j p cannot be in the fast zone. The memory zone can accept at most m items, so the number of 
items in the slow zone is at least 

2 5 
\S\ > -Xfk — bXf / p — m > m + —k. 

3 (b 

This happens with probability at least 1 — l/2^ b > , due to the fact that / is a bad function. On the other 
hand, Lemma 1 states that |5| < m + holds with probability at least 1 — 20, thus by union bound / is a 

good function with probability at least 1 — 2<j) — l/2 n< ^ h \ □ 



A bin-ball game. Lemma 2 enables us to consider only those good functions / after the initial (pn in- 
sertions. To show that any good function will incur a large insertion cost, we first consider the following 
bin-ball game, which captures the essence of performing insertions using a good function. 

In an (s,p, t) bin-ball game, we throw s balls into r (for any r > 1/p) bins independently at random, 
and the probability that any ball goes to any particular bin is no more than p. At the end of the game, an 
adversary removes t balls from the bins such that the remaining s — t balls hit the least number of bins. The 
cost of the game is defined as the number of nonempty bins occupied by the s — t remaining balls. 

We have the following two results with respect to such a game, depending on the relationships among 
s,p, and t. 
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Lemma 3 If sp < g, then for any fj, > 0, with probability at least 1 — e 3 , the cost of an (s,p, t) bin-ball 
game is at least (1 — /u)(l — sp)s — t. 

Proof: Imagine that we throw the s balls one by one. Let X, be the indicator variable denoting the event that 
the j-th ball is thrown into an empty bin. The number of nonempty bins in the end is thus X = Ylj=i -^r 
These Xfs are not independent, but no matter what has happened previously for the first j — 1 balls, we 
always have Pt[Xj = 0] < sp. This is because at any time, at most s bins are nonempty. Let Yj (1 < j < s) 
be a set of independent variables such that 



Yi 



0, with probability sp; 

1, otherwise. 



Let Y = J2j=i Yj- Each Yi is stochastically dominated by Xi, so Y is stochastically dominated by X. We 
have E[Y] = (1 — sp)s and we can apply Chernoff inequality on Y: 

f-P" (l — sp)s /jP 1 s 

Pr [Y < (1 - - sp)s] < e 2 < e r . 

Therefore with probability at least 1 — e~~, we have X > (1 — / u)(l — Finally, since removing 

i balls will reduce the number of nonempty bins by at most t, the cost of the bin-ball game is at least 

(1 — fj,)(l — sp)s — t with probability at least 1 — e 3 . □ 

Lemma 4 If s/2 > t and s/2 > 1/p, then with probability at least 1 — l/2 n ( s \ the cost of an (s,p,t) 
bin-ball game is at least 1 / (20j>). 

Proof: In this case, the adversary will remove at most s/2 balls in the end. Thus we show that with very 
small probability, there exist a subset of s/2 balls all of which are thrown into a subset of at most l/(20p) 
bins. Before the analysis, we merge bins such that the probability that any ball goes to any particular bin is 
between p/2 and p, and consequently, the number of of bins would be between 1/p to 2/ p. Note that such 
an operation will only make the cost of the bin-ball game smaller. Now this probability is at most 



l/(20p) 




< 2 



2/p \( s \/l/(20p)y/ 2 n(a) 
1/(20^ \s/2) \ 1/p J ~ 1 ' 



hence the lemma. □ 

Now we are ready to prove the main theorem. 

Proof: (of Theorem 1) We begin with the first tradeoff. Recall that we choose the following parameters: 
5 = l/b c , cj) = l/fr( c-1 )/ 4 , p = 2&( c+3 )/ 4 /n, s = n/&( c+1 )/ 2 . For the first 4>n items, we do not count their 
insertion costs. We divide the rest of the insertions into rounds, with each round containing s items. We now 
bound the expected cost of each round. 

Focus on a particular round, and let / be the function used by the hash table at the end of this round. 
We only consider the set R of items inserted in this round that are mapped to the good index area of /, i.e., 
R = {x I f(x) $ D-f}; other items are assumed to have been inserted for free. Consider the block with 
index f(x) for a particular x. If x is in the fast zone, the block -B/( x ) must contain x. Thus, the number of 
distinct indices f(x) for x € R n F is an obvious lower bound on the I/O cost of this round. Denote this 
number by Z = \{f(x) \ x G R n F}\. Below we will show that Z is large with high probability. 

We first argue that at the end of this round, each of the following three events happens with high proba- 
bility. 
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• £\'- \S\ < 5n/4> + m; 



• £2- f is a good function; 

• £3: For all good function / £ JF and corresponding slow zones 5 and memory zones M, Z > 
(1 - 0{(j)))s - t, where t = \S\ + |M|. 

By Lemma 1, £\ happens with probability at least 1 — 2(f). By Lemma 2, £2 happens with probability at 
least 1 — 2<p — l/2 n ( b \ It remains to show that £3 also happens with high probability. 

We prove so by first claiming that for a particular / G T with probability at least 1 — e~ 2 $ s , Z is at 
least the cost of a ((1 — 20) s, , t) bin-ball game, for the following reasons: 

1. Since / is a good function, by Chernoff inequality, with probability at least 1 — e~ 2 ^ 2s , more than 
(1 — 2(j))s newly inserted items will fall into the good index area of /, i.e., \R\ > (1 — 2(p)s. 

2. The probability of any item being mapped to any index in the good index area, conditioned on that it 
goes to the good index area, is no more than 

3. Only t items in R are not in the fast zone F, excluding them from R corresponds to discarding t balls 
at the end of the bin-ball game. 

4> 2 -(l-2<j>)s 

Thus by Lemma 3 (setting p, = (p), with probability at least 1 — e 3 — e * s , we have 

Z > (1 - <t>) (l - (1 - 24>)s ■ yztxI) C 1 - 2< £> s - 1 

> (1 - 0) (l - (1 - 24>)s ■ ^-^j (1 - 24>)s -t>(l-0(4>))s-t. 

Thus £ 3 happens with probability at least 1 - ( e ~ + e~ 2 ^ s ) ■ 2 mlo § u = 1 - 2 _n W (by the 

assumption that n > Q(mb 1+2c ) and b > log u) by applying union bound on all good functions in T. 

Now we lower bound the expected insertion cost of one round. By union bound, with probability at 
least 1 — 0((f>) — l/2 n ( b \ all of £1,82, and £3 happen at the end of the round. By £2 and £3, we have 
Z > (1 -0(4>))s-t. Since now t = |5| + |M| < 5n/(f)+2m = O (cj)s) by £ u we have Z > (1 -O (</>)) s. 
Thus the expected cost of one round will be at least 

(l-O (0)) s ■ (l - O(0) - 1/2^)) = (1 - O (</>)) s. 

Finally, since there are (1 — cf))n/s rounds, the expected amortized cost per insertion is at least 

(1 - 0(<t>)) s • (1 - <f>)n/s ■ 1/n = 1 - O (l/b^ . 

For the second tradeoff, we choose the following set of parameters: <ft = 1/k, p = 2nb/n, s = nj (k 2 6) 
and 5 = l/(ft 4 fr) (for some constant k large enough). We can check that Lemma 2 still holds with these 
parameters, and then go through the proof above. We omit the tedious details. Plugging the new parameters 
into the derivations we obtain a lower bound t u > 0(1). 

For the third tradeoff, we choose the following set of parameters: <fi = 1/8, p = 16b /n, s = 32n/b c and 
5 = l/b c . We can still check the validity of Lemma 2, and go through the whole proof. The only difference 
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is that we need to use Lemma 4 in place of Lemma 3, the reason being that for our new set of parameters, 
we have sp = ui(l) thus Lemma 3 does not apply. By using Lemma 4 we can lower bound the expected 
insertion cost of each round by ((1 — 2<p)/(20p)), so the expected amortized insertion cost is at least 

as claimed. □ 



3 Upper Bounds 

In this section, we present a simple dynamic hash table that supports insertions in t u = 0(6 C ~ 1 ) = o(l) 
I/Os amortized, while being able to answer a query in expected t q = 1 + 0(\/b c ) I/Os on average for any 
constant c < 1, under the mild assumption that log — = o(b). Below we first state a folklore result by 
applying the logarithmic method [5] to a standard hash table, achieving t u = o(l) but with t q = f2(l). Then 
we show how to improve the query cost to 1 + 0(l/b c ) while keeping the insertion cost at o(l). We also 
show how to tune the parameters such that t u = e while t q = 1 + 0(l/b), for any constant e > 0. 

Applying the logarithmic method. Fix a parameter 7 > 2. We maintain a series of hash tables Ho , Hi , . . . 
The hash table Hk has 7 fc • y buckets and stores up to ^ k m items, so that its load factor is always at most 
|. It uses the log(7 fc • y) = /clog 7 + log y least significant bits of the hash function h(x) to assign items 
into buckets. We use some standard method to resolve collisions, such as chaining. The first hash table Ho 
always resides in memory while the rest stay on disk. 

When a new item is inserted, it always goes to the memory-resident Ho- When Hq is full (i.e., having 
im items), we migrate all items stored in Ho to H\. If Hi is not empty, we simply merge the corresponding 
buckets. Note that each bucket in Hq corresponds to 7 consecutive buckets in Hi, and we can easily 
distribute the items to their new buckets in Hi by looking at log 7 more bits of their hash values. Thus we 
can conduct the merge by scanning the two tables in parallel, costing 0(7 • y) I/Os at most. This operation 
takes place inductively: Whenever Hk is full, we migrate its items to Hk+i, costing 0(7 fe+1 • y ) I/Os. Then 
standard analysis shows that for n insertions, the total cost is 0(y^ log ^) I/Os, or 0(? log — ) amortized 
I/Os per insertion. However, for a query we need to examine all the 0(log 7 — ) hash tables. 

Lemma 5 For any parameter 7 > 2, there is a dynamic hash table that supports an insertion in amortized 
0(y log ^) I/Os and a (successful or unsuccessful) lookup in expected average 0(log 7 ^) I/Os. 

Improving the query cost. Next we show how to improve the average cost of a successful query to 
1 + 0(l/b c ) I/Os for any constant c < 1, while keeping the insertion cost at o(l). The idea is to try to put 
the majority of the items into one single big hash table. In the standard logarithmic method described above, 
the last table may seem a good candidate, but sometimes it may only contain a constant fraction of all items. 
Below we show how to bootstrap the structure above to obtain a better query bound. 

Fix a parameter 2 < j3 < b. For the first m items inserted, we dump them in a hash table H on disk. 
Then run the algorithm above for the next m/f3 items. After that we merge these m/ (5 items into H. We 
keep doing so until the size of H has reached 2m, and then we start the next round. Generally, in the z-th 
round, the size of H goes from 2 l ~ 1 m to 2 % m, and we apply the algorithm above for every 2 l ~ 1 m/ '(3 items. 
It is clear that H always have at least a fraction of 1 — 4 of all the items inserted so far, while the series 
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of hash tables used in the logarithmic method maintain at least a separation factor of 2 in the sizes between 
successive tables. Thus, the expected average query cost is at most 

(l + l/2°<«) (l-(l-l) +1(2.1 + 3.1 + ...)) =l + 0(l//3). 

Next we analyze the amortized insertion cost. Since the number of items doubles every round, it is 
(asymptotically) sufficient to analyze the last round. In the last round, H is scanned (3 times, and we charge 
0(f3/b) I/Os to each of the n items. The logarithmic method is invoked j3 times, but every invocation handles 
0(n//3) different items. From Lemma 5, the amortized cost per item is still log — ) I/Os. So the total 
amortized cost per insertion is 0(^((3 + 7 log ^)) I/Os. Let the constant in this big-0 be d . Then setting 
(3 = b c (or respectively (3 = t^t • b) and 7 = 2 yields the desired results, as long as log ^ = o(b). 

Theorem 2 For any constant c < 1, e > 0, there is a dynamic hash table that supports an insertion in 
amortized 0{b c ~ l ) I/Os and a successful lookup in expected average 1 + 0(l/b c ) I/Os, or an insertion in 
amortized e I/Os and a successful lookup in expected average 1 + 0{l/b) I/Os, provided that log — = 0(6). 
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